{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7f0cceea",
   "metadata": {},
   "source": [
    "# Dataloader with Popmon Reports"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e79d9c5",
   "metadata": {},
   "source": [
    "This demo is to cover the usage of popmon with the dataloader from the dataprofiler\n",
    "\n",
    "This demo covers the followings:\n",
    "\n",
    "    - How to install popmon\n",
    "    - Comparison of the dynamic dataloader from dataprofiler to the \n",
    "        standard dataloader used in pandas\n",
    "    - Popmon's usage example using both dataloaders\n",
    "    - Dataprofiler's examples using both dataloaders\n",
    "    - Usage of the pm_stability_report function (popmon reports)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aec2198a",
   "metadata": {},
   "source": [
    "## How to Install Popmon\n",
    "To install popmon you can use the command below:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4383ed2a",
   "metadata": {},
   "source": [
    "`pip3 install popmon`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91dedc34",
   "metadata": {},
   "source": [
    "From here, we can import the libararies needed for this demo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2adec556",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "try:\n",
    "    sys.path.insert(0, '..')\n",
    "    import dataprofiler as dp\n",
    "except ImportError:\n",
    "    import dataprofiler as dp\n",
    "import pandas as pd\n",
    "import popmon  # noqa"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ed532ec",
   "metadata": {},
   "source": [
    "## Comparison of Dataloaders"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cccbf4cd",
   "metadata": {},
   "source": [
    "First, we have the original pandas dataloading which works for specific file types. \n",
    "This is good for if the data format is known ahead of time but is less useful for more dynamic cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96e9ff89",
   "metadata": {},
   "outputs": [],
   "source": [
    "def popmon_dataloader(path, time_index):\n",
    "    # Load pm dataframe (Can only read csvs unless reader option is changed)\n",
    "    if not time_index is None:\n",
    "        pm_data = pd.read_csv(path, parse_dates=[time_index])\n",
    "    else:\n",
    "        time_index = True\n",
    "        pm_data = pd.read_csv(path)\n",
    "    return pm_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16dfbe10",
   "metadata": {},
   "source": [
    "Next, we have the dataprofiler's dataloader. This allows for the dynamic loading of different data formats which is super useful when the data format is not know ahead of time.\n",
    "This is intended to be an improvement on the dataloader standardly used in pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "07481259",
   "metadata": {},
   "outputs": [],
   "source": [
    "def dp_dataloader(path):\n",
    "    # Datalaoder from dataprofiler used\n",
    "    dp_data = dp.Data(path)\n",
    "    \n",
    "    # Profiler used to ensure proper label for datetime even \n",
    "    # when null values exist\n",
    "    profiler_options = dp.ProfilerOptions()\n",
    "    profiler_options.set({'*.is_enabled': False,  # Runs first disabling all options in profiler\n",
    "                          '*.datetime.is_enabled': True})\n",
    "    profile = dp.Profiler(dp_data, options=profiler_options)\n",
    "\n",
    "    # convert any time/datetime types from strings to actual datatime type\n",
    "    for ind, col in enumerate(dp_data.data.columns):\n",
    "        if profile.profile[ind].profile.get('data_type') == 'datetime':\n",
    "            dp_data.data[col] = pd.to_datetime(dp_data.data[col])\n",
    "\n",
    "    return dp_data.data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69a8ea9b",
   "metadata": {},
   "source": [
    "## Popmon's usage example using both dataloaders"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff914ca7",
   "metadata": {},
   "source": [
    "Next, we'll download a dataset from the resources component"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bff33da8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import gzip\n",
    "import shutil\n",
    "popmon_tutorial_data = popmon.resources.data(\"flight_delays.csv.gz\")\n",
    "with gzip.open(popmon_tutorial_data, 'rb') as f_in:\n",
    "    with open('./flight_delays.csv', 'wb') as f_out:\n",
    "        shutil.copyfileobj(f_in, f_out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19222c4a",
   "metadata": {},
   "source": [
    "Finally we read in the data with popmon and print the report to a file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0090a2f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Default csv from popmon example\n",
    "path = \"./flight_delays.csv\"\n",
    "time_index = \"DATE\"\n",
    "report_output_dir = \"./popmon_output/flight_delays_full\"\n",
    "if not os.path.exists(report_output_dir):\n",
    "    os.makedirs(report_output_dir)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0abcd9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "pm_data = popmon_dataloader(path, time_index)\n",
    "\n",
    "report_pm_loader = pm_data.pm_stability_report(\n",
    "    time_axis=time_index,\n",
    "    time_width=\"1w\",\n",
    "    time_offset=\"2015-07-02\",\n",
    "    extended_report=False,\n",
    "    pull_rules={\"*_pull\": [10, 7, -7, -10]},\n",
    ")\n",
    "\n",
    "# Save popmon reports\n",
    "report_pm_loader.to_file(os.path.join(report_output_dir, \"popmon_loader_report.html\"))\n",
    "print(\"Report printed at:\", os.path.join(report_output_dir, \"popmon_loader_report.html\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2303b5cf",
   "metadata": {},
   "source": [
    "We then do the same for the dataprofiler loader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2854383",
   "metadata": {},
   "outputs": [],
   "source": [
    "dp_dataframe = dp_dataloader(path)\n",
    "# Generate pm report using dp dataloader\n",
    "report_dp_loader = dp_dataframe.pm_stability_report(\n",
    "    time_axis=time_index,\n",
    "    time_width=\"1w\",\n",
    "    time_offset=\"2015-07-02\",\n",
    "    extended_report=False,\n",
    "    pull_rules={\"*_pull\": [10, 7, -7, -10]},\n",
    ")\n",
    "\n",
    "# Save popmon reports\n",
    "report_dp_loader.to_file(os.path.join(report_output_dir, \"dataprofiler_loader_report.html\"))\n",
    "print(\"Report printed at:\", os.path.join(report_output_dir, \"dataprofiler_loader_report.html\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cc4e5f3",
   "metadata": {},
   "source": [
    "## Examples of data\n",
    "Next, We'll use some data from the test files of the data profiler to compare the dynamic loading of the dataprofiler's data loader to that of the standard pandas approach. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "352eaeea",
   "metadata": {},
   "source": [
    "## Dataprofiler's examples using both dataloaders"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e99af913",
   "metadata": {},
   "source": [
    "To execute this properly, simply choose one of the 3 examples below and then run the report generation below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80eb601d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Default csv from popmon example (mini version)\n",
    "path = \"../dataprofiler/tests/data/csv/flight_delays.csv\"\n",
    "time_index = \"DATE\"\n",
    "report_output_dir = \"./popmon_output/flight_delays_mini\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c127288",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Random csv from dataprofiler tests\n",
    "path = \"../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv\"\n",
    "time_index = \"datetime\"\n",
    "report_output_dir = \"./popmon_output/aws_honeypot_marx_geo\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6cd5c385",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Random json file from dataprofiler tests\n",
    "path = \"../dataprofiler/tests/data/json/math.json\"\n",
    "\n",
    "time_index = \"data.9\"\n",
    "report_output_dir = \"./popmon_output/math\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec860cb7",
   "metadata": {},
   "source": [
    "Run the block below to create an output directory for your popmon reports."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf21835c",
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.exists(report_output_dir):\n",
    "    os.makedirs(report_output_dir)\n",
    "dp_dataframe = dp_dataloader(path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "479975a5",
   "metadata": {},
   "source": [
    "## Report comparison"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02a355e7",
   "metadata": {},
   "source": [
    "We generate reports using different sets of data from the dataprofiler and pandas below using dataprofiler's dataloader and popmons report generator\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ce69145",
   "metadata": {},
   "source": [
    "The dataprofiler's dataloader can seemlessly switch between data formats and generate reports with the exact same code in place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0dcb405",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Generate pm report using dp dataloader\n",
    "report_dp_loader = dp_dataframe.pm_stability_report(\n",
    "    time_axis=time_index,\n",
    "    time_width=\"1w\",\n",
    "    time_offset=\"2015-07-02\",\n",
    "    extended_report=False,\n",
    "    pull_rules={\"*_pull\": [10, 7, -7, -10]},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9eb0035c",
   "metadata": {},
   "source": [
    "If the dataloaders are valid, you can see the reports and compare them at the output directory specified in the printout below each report generation block (the two code blocks below)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "efe7d8d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save dp reports\n",
    "report_dp_loader.to_file(os.path.join(report_output_dir, \"dataprofiler_loader_report.html\"))\n",
    "print(\"Report printed at:\", os.path.join(report_output_dir, \"dataprofiler_loader_report.html\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
